Loading the MASS “Boston” dataset already loaded in R. Link to dataset
This is a data set about Housing Values in Suburbs of Boston.
Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. < em >J. Environ. Economics and Management < b >5, 81???102.
Belsley D.A., Kuh, E. and Welsch, R.E. (1980) < em >Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley
Variables in the dataset:
library(MASS)
data("Boston")
dim(Boston) #506 observation of 14 variables
## [1] 506 14
colnames(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
Names of the variables in the data
506 observation of 14 variables
gather(Boston) %>%
ggplot(aes(value)) + facet_wrap("key", scales= "free") + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Visualisation of the variables. the variable rm is the only one looking normally distribiuted, the others seem rather skewed.
library(tidyr)
library("ggplot2")
library("GGally")
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
library("corrplot")
## corrplot 0.84 loaded
cor_matrix <- cor(Boston)%>%round(2)
cor_matrix
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
p1 <- ggcorr(cor_matrix,geom="circle")
# correlation plot matrix
p1
?ggcorr
p2<- corrplot(cor_matrix, method = "circle",type="upper",cl.pos="b",tl.pos="d",tl.cex = 0.6)
p2
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
Looking at the correlations plot we can see that positively correlated variables are; rad (index of accesibility to radial highways), lstat (lower status of the population, percent) tax (full-value property-tax rate per $10,000) to crim (our variable of interest is crimerate per capita by town). Negatively associated with crim are medv (median value of owner-occupied homes in $1000s) and dis (weighted mean of distances to five Boston employment centres).
scaled_boston <- scale(Boston)
summary(scaled_boston)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
class(scaled_boston)
## [1] "matrix"
scaled_boston <- as.data.frame(scaled_boston)
class(scaled_boston)
## [1] "data.frame"
summary(scaled_boston$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
crim_vector <- quantile(scaled_boston$crim) #new crime variable
crim_vector
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
crime <-cut(scaled_boston$crim, breaks = crim_vector, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
scaled_boston <- dplyr::select(scaled_boston, -crim)
scaled_boston <- data.frame(scaled_boston,crime)
n <- nrow(scaled_boston)
n
## [1] 506
train <- sample(n, size = n * 0.8)
train_set <- scaled_boston[train,] # training set with 80% of obs.
dim(train_set)
## [1] 404 14
test <- scaled_boston [-train,] # test set with 20% of obs.
dim(test)
## [1] 102 14
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
lda1<-lda(crime~., data=train_set)
lda1
## Call:
## lda(crime ~ ., data = train_set)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2599010 0.2524752 0.2376238 0.2500000
##
## Group means:
## zn indus chas nox rm
## low 1.0255060 -0.9704232 -0.084848104 -0.9033402 0.4853065
## med_low -0.1402296 -0.2150198 -0.002135914 -0.5445225 -0.1693487
## med_high -0.3791541 0.2207722 0.342842844 0.4184796 0.1877371
## high -0.4872402 1.0171306 -0.155385496 1.0156493 -0.3928745
## age dis rad tax ptratio
## low -0.9127882 0.9197667 -0.6865511 -0.7229099 -0.48447729
## med_low -0.3031989 0.2775597 -0.5585147 -0.4459476 -0.06595478
## med_high 0.4419249 -0.4235225 -0.3741407 -0.2850365 -0.45339490
## high 0.7969423 -0.8480999 1.6379981 1.5139626 0.78062517
## black lstat medv
## low 0.37655659 -0.77203627 0.5753937
## med_low 0.30928750 -0.05814543 -0.0462062
## med_high 0.07180179 -0.01150049 0.2550062
## high -0.74199322 0.97553322 -0.7596165
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.09018311 0.768620075 -0.76721089
## indus 0.05899202 -0.406226484 0.35021226
## chas -0.09859431 -0.103881331 -0.01074262
## nox 0.35974494 -0.549198616 -1.38752235
## rm -0.10804618 -0.091099853 -0.10700659
## age 0.29884902 -0.287099988 -0.23657261
## dis -0.04519730 -0.228077340 -0.04805085
## rad 3.09585371 0.768304680 -0.16273782
## tax -0.10581076 0.071867706 0.48226534
## ptratio 0.05227493 0.180134604 -0.18895075
## black -0.13696293 -0.005543041 0.11817477
## lstat 0.21483170 -0.139471695 0.39260667
## medv 0.16769680 -0.295003538 -0.29847084
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9371 0.0459 0.0170
classes <- as.numeric(train_set$crime) # 1,2,3,4 in stead of low,med_low,med_high, high
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
plot(lda1, dimen = 2, col = classes, pch = classes)
lda.arrows(lda1, myscale = 1)
##Predict with LDA
# predict classes with test data
lda.pred <- predict(lda1, newdata = test)
# cross tabulate the results
table(correct = lda.pred$class, prediction = correct_classes)%>%addmargins()
## prediction
## correct low med_low med_high high Sum
## low 9 5 1 0 15
## med_low 11 19 15 0 45
## med_high 2 0 14 0 16
## high 0 0 0 26 26
## Sum 22 24 30 26 102
Prediction was right in 56% of test set
high crime was prdicted in 16/18 cases
med_high crime was predicted in 18/24
high crime was predicted in 26/27
library(MASS)
data("Boston")
scaled_boston <- scale(Boston)
#determining the number of clusters
wss <- (nrow(scaled_boston)-1)*sum(apply(scaled_boston,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(scaled_boston,
centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
km<- kmeans(scaled_boston,3)
summary(km)
## Length Class Mode
## cluster 506 -none- numeric
## centers 42 -none- numeric
## totss 1 -none- numeric
## withinss 3 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 3 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
km1 <- kmeans(scaled_boston, 5)
summary(km1)
## Length Class Mode
## cluster 506 -none- numeric
## centers 70 -none- numeric
## totss 1 -none- numeric
## withinss 5 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 5 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
3 clusters would be appropriate.
pairs(scaled_boston, col=km$cluster)
data("Boston")
scaled_boston <- scale(Boston)
dim(scaled_boston)
## [1] 506 14
km_bonus <- kmeans(scaled_boston, 4)
class(scaled_boston)
## [1] "matrix"
scaled_boston <- as.data.frame(scaled_boston)
class(scaled_boston)
## [1] "data.frame"
lda2 <- lda(km_bonus$cluster~., data = scaled_boston)
lda2
## Call:
## lda(km_bonus$cluster ~ ., data = scaled_boston)
##
## Prior probabilities of groups:
## 1 2 3 4
## 0.1047431 0.4031621 0.2529644 0.2391304
##
## Group means:
## crim zn indus chas nox rm
## 1 1.9480908 -0.4872402 1.0149946 -0.272329068 1.0357640 -0.6389486
## 2 -0.3883281 -0.3346480 -0.4362541 -0.002135914 -0.4591933 -0.2307927
## 3 0.1886274 -0.4872402 1.2111512 0.127532675 1.0860073 -0.3665114
## 4 -0.3981339 1.2930469 -0.9902994 -0.012024920 -0.8283387 1.0566896
## age dis rad tax ptratio black
## 1 0.8260197 -0.9301775 1.6596029 1.5294129 0.80577843 -2.13287198
## 2 -0.1783550 0.2117619 -0.5770929 -0.6174348 0.09480641 0.31558148
## 3 0.8044599 -0.8146389 0.7955609 0.9872011 0.34603822 0.04414302
## 4 -0.9121115 0.9121798 -0.5955687 -0.6732556 -0.87884014 0.35548171
## lstat medv
## 1 1.4464782 -1.18461968
## 2 -0.1423656 -0.09135041
## 3 0.5301732 -0.40331922
## 4 -0.9544043 1.09954700
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## crim -0.31028992 0.62224451 0.597212526
## zn -0.09338930 0.96434025 -0.716986122
## indus -1.40992137 -0.16185691 -1.004128879
## chas 0.06950551 -0.12316233 0.073570823
## nox -0.48313830 0.10654440 -0.870736311
## rm 0.25553149 0.52639418 -0.282378921
## age -0.07134757 -0.31940794 -0.045053040
## dis 0.06345902 0.01924507 -0.380206568
## rad -0.85654373 0.09761292 0.003307117
## tax -0.37874646 0.14644700 -0.545068105
## ptratio -0.01835256 -0.12874013 0.128579267
## black 0.56785879 -0.68819820 -0.950512993
## lstat -0.09283319 0.55997738 0.261272880
## medv 0.20180072 0.60601649 -0.347080865
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.7798 0.1174 0.1029
classes1 <- as.numeric(km_bonus$cluster)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
plot(lda2, dimen = 2, col = classes1, pch = classes1)
lda.arrows(lda2, myscale = 1)
##superbonus
model_predictors <- dplyr::select(train_set, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda1$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda1$scaling
matrix_product <- as.data.frame(matrix_product)
library("plotly")
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train_set$crime)